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Abstract A fully real-time coherent dedispersion system has been developed for the 
pulsar back-end at the Giant Metrewave Radio Telescope (GMRT). The dedispersion 
pipeline uses the single phased array voltage beam produced by the existing GMRT 
software back-end (GSB) to produce coherently dedispersed intensity output in real 
time, for the currently operational bandwidths of 16 MHz and 32 MHz. Provision 
has also been made to coherently dedisperse voltage beam data from observations 
recorded on disk. 

We discuss the design and implementation of the real-time coherent dedispersion 
system, describing the steps carried out to optimise the performance of the pipeline. 
Presently functioning on an Intel Xeon X5550 CPU equipped with a NVIDIA Tesla 
C2075 GPU, the pipeline allows dispersion free, high time resolution data to be ob¬ 
tained in real-time. We illustrate the significant improvements over the existing in¬ 
coherent dedispersion system at the GMRT, and present some preliminary results 
obtained from studies of pulsars using this system, demonstrating its potential as a 
useful tool for low frequency pulsar observations. 

We describe the salient features of our implementation, comparing it with other 
recently developed real-time coherent dedispersion systems. This implementation of 
a real-time coherent dedispersion pipeline for a large, low frequency array instrument 
like the GMRT, will enable long-term observing programs using coherent dedisper¬ 
sion to be carried out routinely at the observatory. We also outline the possible im¬ 
provements for such a pipeline, including prospects for the upgraded GMRT which 
will have bandwidths about ten times larger than at present. 
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1 Introduction 

The interstellar medium of our Galaxy, consisting of large regions of ionized gas with 
free electrons, can be modeled as a cold tenuous plasma which alters the character¬ 
istics of any electromagnetic wave passing through it. Pulsar signals are ideal tools 
to study these effects because of the pulsed nature of their signal, which allows one 
to probe the effects of dispersion and scattering. Furthermore, the compact nature 
of the source allows one to observe the effects of scintillation. On the other hand, 
the same dispersion and scattering effects can be a hindrance to the detection and 
detailed study of short period pulsars, especially at low radio frequencies and typical 
Galactic distances (see and ifTSll for a review of these effects). This becomes very 
important for low frequency radio telescopes such as the GMRT 


1.1 Pulsar observations with the GMRT 

The GMRT located in Khodad near Pune, India is a multi-element aperture synthesis 
telescope consisting of 30 antennas, each with a diameter of 45 metres, and spread 
over a region of 25 km diameter ( EOlO . It currently operates in 5 frequency bands 
centered around 150 MHz, 235 MHz, 325 MHz, 610 MHz and 1000-1450 MHz, with 
a maximum available bandwidth of 32 MHz. The high sensitivity of the GMRT at low 
frequencies, as well as its wide frequency coverage with reasonably large bandwidths 
makes it a powerful instrument for pulsar studies. Pulsar observations can be carried 
out in the array mode in two ways: the Incoherent Array (lA) mode or the Phased 
Array (PA) mode(||9l). 

In the lA mode, the power signals from each of the antennas are added to produce 
the final total intensity signal, whereas in the PA mode, voltage signals from the 
antennas are phased, added and then detected to produce the final signal. The intensity 
signals in both these modes have 256 or 512 spectral channels, leading to minimum 
time resolutions of 16 and 32 microseconds, respectively. These can be integrated 
to produce the desired time resolution, and then recorded on disk as intensities in 
multiple spectral channels. The PA mode also has a voltage beam mode where the 
raw voltage from a single phased array beam can be recorded on a disk, as the real 
and imaginary parts from the output of a Fast Fourier Transform (FFT) on the original 
voltage time series, with 256 or 512 spectral channels. 


1.2 Dedispersion techniques 

Pulsar observations at low frequencies are distorted by the deleterious effects of dis¬ 
persion, which can limit the utility of such an instrument. When an electromagnetic 
wave travels through the ionized interstellar medium, low frequency components of 
the signal travel slower than the higher frequency components due to dispersion, 
which causes a broadband sharp pulse from the source to be smeared out in time 
when detected with a receiver having a finite bandwidth. This effect becomes more 
significant at lower frequencies (as the dispersion delay for a frequency / increases 
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as jj), and hence proper correction techniques are vital for pulsar observations in this 
regime. 

The correction for dispersion involves accounting for the frequency dependent 
velocity of the electromagnetic waves in the interstellar medium, and applying ap¬ 
propriate time delays, or alternatively, phase corrections in the frequency domain to 
correct for dispersion. This leads to two techniques for implementing such modifi¬ 
cations ; incoherent and coherent dedispersion ( IITtI '). In the former, the observable 
bandwidth is split into a number of smaller channels, and the signal is detected sep¬ 
arately in each of the channels. Appropriate time delays are then applied to align the 
pulses in the separate frequency channels, which are then added to produce a dedis- 
persed time series. It is evident that this technique does not remove the intra-channel 
dispersion time, which can only be reduced by using a larger number of channels 
across the bandwidth. However, this comes at the cost of reduced time resolution 
since the Nyquist time resolution of a channel of width 5v is 

Coherent dedispersion ( ifTTIl ') can provide significant improvements over incoher¬ 
ent dedispersion by completely removing the effects of interstellar dispersion, and 
recovering the original Nyquist time resolution of the sampled voltage signal. This is 
done by deconvolving the received voltage signal at the telescope with the transfer 
function of the interstellar medium, which is treated as a linear filter. The length of 
the impulse response to be deconvolved is roughly proportional to the inverse cube of 
the frequency of observation, linearly proportional to the Dispersion Measure (DM) 
along the line of sight, and has a quadratic dependence on the bandwidth of observa¬ 
tion. Since convolutions are efficiently computed in the frequency domain by using 
efficient FFT algorithms, it is evident that implementing such a technique at low fre¬ 
quencies and large bandwidths is a challenging task due to the computation of the 
long FFTs (the length of which has to be larger than the impulse response) that are 
involved in the process. 

The inaccuracies of the incoherent dedispersion technique can be a significant 
drawback for observations at low frequencies requiring high time resolution. Co¬ 
herent dedispersion is thus very useful for pulsar observations with low frequency 
telescopes such as the Giant Metrewave Radio Telescope (GMRT ll20l ). the Long 
Wavelength Amay (LWA ||2T1), the Low Frequency AiTay (LOFAR Ell ') and the 
Murchison Widefield Array (MWA EtI ). 

As a demonstration of the merits of the coherent dedispersion technique, we 
present incoherently and coherently dedispersed folded profiles of the pulsar B1 937 h- 2 1, 
observed with the GMRT at 325 MHz and dedispersed with our offline coherent 
dedispersion system, in Figure Since this pulsar has a relatively large DM of 71 
pc/cc and a very short period of 1.55 ms, the effects of intra-channel dispersion time 
are clearly evident in terms of smearing of the folded profile in the incoherently dedis¬ 
persed case. On the other hand, the coherently dedispersed profile shows sharper 
pulses and a clear indication of a scattering ‘tail’ (described in Section [T~4) i. We de¬ 
scribe the details of the coherent dedispersion technique in Section]^ 
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Folded profile 



(a) Incoherently dedispersed and normalized folded profile for B1937+21. 


Folded profile 



(b) Coherently dedispersed and normalized folded profile for B1937+21. 

Fig. 1 Incoherently and coherently dedispersed normalized folded profiles of the pulsar B1937+21 ob- 
sei-ved with the GMRT at 325 MHz, with a bandwidth of 16 MHz for a duration of 10 minutes. The 
normalization has been done such that the entire pulse profile intensity range extends between 0 and 1. 
Our offline coherent dedispersion system has been used to dedisperse the raw voltage data. The pulsar has 
a period of 1.55 ms and a DM of 71 pc/cc. 


1.3 Real-time pipelines for coherent dedispersion 

Coherent dedispersion necessarily requires one to process the original Nyquist sam¬ 
pled voltage signal from the phased array beam. Since this voltage data cannot be 
integrated, it either has to be stored on a disk for offline processing, or processed 
in real-time. Recording voltage data in such cases can require very large amounts 
of disk space (particularly for long term programs such as high precision timing), 
and may be a difficult task as one goes for larger bandwidths of observation. For 
example, with the current GMRT software back-end l lfT^ i. the data recording rate 
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is 120 GB/hour in the 16 MHz mode and 240 GB/hour in the 32 MHz mode of ob¬ 
servation. For the 200 MHz bandwidth of the upgraded GMRT wideband back-end, 
the data recording rate required would be 1500 GB/hour, which would be impracti¬ 
cal for recording followed by offline processing, especially for long term observing 
programs. The data rates would be similar for other low frequency telescopes with 
similar bandwidths. For higher frequency telescopes like the Green Bank Telescope 
OfTTlO where the bandwidths may be larger, the data rates would be proportionately 
higher. 

Implementing coherent dedispersion in real-time provides a solution to overcome 
these drawbacks. However, the process of coherent dedispersion itself is computa¬ 
tionally expensive (especially at low radio frequencies), requiring the computation of 
large FFTs (often longer than a million points), and hence requires a large amount 
of computing power to implement in real-time. An effective way to reduce the com¬ 
putational load of the process is dividing the bandwidth of the receiver into a num¬ 
ber of channels, coherently dedispersing each sub-band, followed by alignment of 
the individual sub-bands in time (the individual sub-bands thus do not contain any 
intra-channel dispersion). Using smaller sub-bands reduces the length of the impulse 
response to be convolved with, thereby reducing the required length of the FFTs and 
the overall computational load. However, this comes at the cost of reduced time reso¬ 
lution, which is now limited by the inverse of the bandwidth of the individual smaller 
sub-bands. 

One of the first real-time coherent dedispersion systems was developed in the 
mid-1980s by Hankins and Rajkowski dflOl ') at the Arecibo observatory for a mod¬ 
est bandwidth of 1.5 MHz centered at a frequency of 1414 MHz using a hardware 
based implementation. Before this, coherent dedispersion was most commonly im¬ 
plemented by recording data on magnetic tape, followed by offline processing in 
hardware / software based systems. The recent emergence of parallel computing on 
CPU-based clusters and GPUs has provided a powerful tool to realize the associated 
computational challenges and develop real-time coherent dedispersion systems. 

The PuMa-II back-end developed at the Westerbork Synthesis Radio Telescope 
(WSRT ifTSll ') demonstrated the use of a CPU cluster to implement a near real-time 
coherent dedispersion system for a bandwidth of 160 MHz, by dividing the band¬ 
width into 8 sub-bands, each with a bandwidth of 20 MHz, and then dedispersing and 
combining the signal. The system first records the baseband signal into a cluster of 
disks, which is then processed with the a 32 node cluster in near real-time. It has been 
implemented with the code available from the open-source software library Digital 
Signal Processing Software for Pulsar Astronomy (DSPSR 123]). The DSPSR was 
developed for pulsar observations at the Parkes telescope and allows coherent dedis¬ 
persion to operate in multiple modes involving both CPUs and GPUs. As with the 
dedispersion system in PuMa-II, DSPSR implements coherent dedispersion by form¬ 
ing a filter-bank of smaller bandwidths, which are coherently dedispersed, detected 
and aligned in time. 

The Green Bank Ultimate Pulsar Processing Instrument (GUPPI 15] ), functioning 
on FPGAs, a CPU cluster and a GPU cluster of 8 GPUs includes a wide bandwidth 
coherent dedispersion system supporting a total bandwidth of 800 MHz. The total 
bandwidth is divided into 8 smaller sub-bands, each of which can be processed in 
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one GPU of the 8 GPU cluster. The Nancay telescope has recently developed a real¬ 
time GPU-based coherent dedispersion back-end (lUl). A GPU cluster consisting of 
4 NVIDIA GeForce 8800 GTX cards is used to coherently dedisperse a bandwidth of 
128 MHz (after sub-banding into four 32 MHz bands), at frequencies in the L-band. 
The range of FFT lengths implemented by these systems (typically of the order of 1 
million points) are however much smaller than that required for dedispersing signals 
at low frequencies with similar bandwidths and DMs. 


1.4 Limitations due to scattering 

Scattering occurs due to electron density fluctuations in the interstellar medium, 
which causes part of the signal from the source to be deflected from its original path 
of propagation and travel along longer nearby paths. In the case of pulsars, this time 
delay causes narrow pulses to be spread out into an exponential ‘tail’ (see ifTSl l. The 
effective time resolution obtainable from coherent dedispersion is thus limited by 
scatter broadening in the interstellar medium which, unlike dispersion, cannot be re¬ 
moved by post-processing techniques. Hence, it is a useful exercise to estimate the 
best time resolution obtainable in the presence of scattering. 

To estimate the effect of scattering for the frequency bands in use at the GMRT, 
we use the empirical relation determined by Bhat et al. (JJ)) relating the scattering 
time Tscat to the dispersion measure (DM) of the pulsar. We compare the estimated 
scattering time iscat as a function of DM for a hxed observing frequency (center 
frequencies of 325 MHz and 610 MHz) with the intra-channel smear time for inco¬ 
herent dedispersion using 256 channels (as implemented at the GMRT) for a fixed 
bandwidth of observation. The results are shown in Figure]^ for two sets of param¬ 
eters. Note that the Xscat - DM relation has a signihcant scatter in observations (the 
plots show a scatter of one order of magnitude, which is typically observed El), and 
hence the plots are only meant to give approximate estimates. 

As can be seen from Figure]^ for small values of DM, the scattering time is sig¬ 
nificantly smaller than the intra-channel dispersion time for incoherent dedispersion. 
As a result, coherent dedispersion can be useful in this regime for obtaining better 
time resolution. The scattering curve still provides a lower limit to the obtainable 
time resolution (instead of the Nyquist resolution) after coherent dedispersion has 
been performed. However, above a particular value of DM (depending on the fre¬ 
quency of observation), the scattering time dominates over the intra-channel smear 
time, and hence coherent dedispersion would provide no improvement over incoher¬ 
ent dedispersion in this regime. For 610 MHz observations, this limiting value of the 
DM is around 100 pc/cc, and it is a bit less than that for 325 MHz observations. 


2 Coherent dedispersion 

2.1 ISM transfer function 

The action of the interstellar medium on a propagating electromagnetic wave can 
be considered as that of a linear filter with a transfer function H{f), where / is the 
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Fig. 2 Intra-channel smear time in 256 channel incoherent dedispersion and scattering time as a function 
of DM for a given observing frequency. The black curves correspond to the approximate smearing time 
due to scattering only (note that the relation has a significant scatter and hence these curves only give an 
approximate estimate). The green dashed lines above and below the scattering curve show a confidence 
interval of one order of magnitude in the scattering time, which is typically observed (II). The red and 
blue curves correspond to the intra-channel smear time in incoherent dedispersion only, for bandwidths of 
16 MHz and 32 MHz respectively. The channel widths for the 16 MHz and 32 MHz modes are 62.5 KHz 
and 125 KHz respectively. 


observing frequency. Let Vo(/) and Vr{f) be the Fourier transforms of the electro¬ 
magnetic wave emitted at the pulsar and that received at the telescope respectively. 
Considering a real-sampled signal, let the bandwidth of observation extend from 
a frequency of to fo -I- Af, where fo is the lower edge frequency of the band, 
and Af is the bandwidth. If // be the deviation from the lower edge of the band 
so that fi=f — fo, then the expression for the received wave can be expressed as 
Vr(/o + //) = Voifo + fi)H{fo + fi). The phase picked up by an electromagnetic wave 
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travelling though the ISM is given by (see M), 


^{f) = 2nf- 
c 



( 1 ) 


where L is the distance travelled in the ionized plasma, fp is the plasma frequency of 
the medium and c is the speed of light. Hence, the transfer function can be written as 
H{fo + fi) = which, using Equationj^ can be rewritten as 


^InL 

H{fo+fi)=e ^ 




fi- 


fl 

jp 




( 2 ) 


The first term in the square brackets is a constant phase offset which is lost in de¬ 
tection, while the second term is a linear frequency gradient which corresponds to 
a time delay in the arrival of the pulse. The third term causes the dispersion within 
the band, and needs to be corrected for. Hence the transfer function to correct for the 
dispersion within the band is the inverse of the third term in Equation]^ In terms of 
the dispersion measure of the pulsar, the third term can be expressed as, 

Hifo+fi)=eW)J^ (3) 

where D is a constant equal to 4.148808 x 10® cc/pc MHz. 


2.2 ISM impulse response and its deconvolution 

The impulse response associated with the transfer function in Equation [^has been 
numerically computed for two sets of parameters and are shown in Eigure^(a), with 
the normalized amplitude of the response plotted as a function of time. The necessary 
parameters in this case are the observing frequency, bandwidth of observation and the 
DM along the line of sight to the pulsar. The length of the response is equal to the 
dispersion time in the observing band. 

The discontinuous termination of the transfer function in the frequency domain 
(corresponding to multiplication of the transfer function with a rectangular window, 
with a width equal to the bandwidth of the receiver) produces ripples in the time 
domain impulse response. These ripples can produce unreal artifacts in the data, and 
hence need to be eliminated with the use of an appropriate window or taper function 
in the frequency domain. The modified impulse response for two well known window 
functions, the Hanning and Welch windows are shown in Eigure[^(b), which show 
reduced levels of oscillations, however, at the cost of reduced sensitivity to the edges 
of the band. The taper function multiplied with the inverse transfer function is known 
as the chirp function. 

Deconvolution of the ISM impulse response requires proper book-keeping of 
edge effects to prevent aliasing in the computed time series. The overlap-save method 
is a standard method of computing long convolutions, as described in IJl, where the 
number of samples corresponding to the length of the impulse response is discarded 
from the beginning of the output of a time series segment which has been convolved, 
while the same number of samples at the end is overlapped with the next such seg¬ 
ment in the data to maintain continuity in the time series. 
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ISM Impulse Response Amplitude 



(a) Normalized Impulse Response Amplitude. 


ISM Impulse Response Amplitude for 2 Window Functions 



(b) Effect of applying a window function. 

Fig. 3 The first panel shows the normalized impulse response amplitude for a center frequency of 610 
MHz, bandwidth of 32 MHz and DMs of 20 pc/cc (red) and 40 pc/cc (blue). The second panel shows the 
impulse response when a taper function is applied for the case of two windows: the Hanning (red) and 
Welch (blue) windows for the same frequency parameters and DM of 40 pc/cc. 


2.3 Implementation for the GMRT 

At the GMRT, the voltage beam data is recorded as 8-bit integers from the output of a 
FFT performed on the original sampled voltage from the phased array. Consequently, 
one needs to perform an inverse FFT on the beam data to get back the original time 
series, and then perform a longer FFT to implement the deconvolution algorithm 
of coherent dedispersion. The length of the long FFT (designated with the letter ‘N’ 
henceforth) has to be at least greater than the number of samples corresponding to the 
length of the impulse response, i.e., the dispersion time in the band. Since FFTs can 
be efficiently computed if the length is a power of 2, we choose the number of samples 
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to be the smallest power of 2 greater than twice the number of samples corresponding 
to the dispersion time. The phase changes are then applied to the spectral output from 
the long FFT, and the overlap-save method is used to get back dedispersed data in the 
time domain. The final algorithm implemented for the GMRT is schematically shown 
in Figure]^ 



Fig. 4 The coherent dedispersion algorithm for the GMRT. Here ‘D’ corresponds to the number of points 
to be overlapped for implementing the overlap-save algorithm, i.e. the length of the impulse response. 


3 Optimizing a CPU based implementation 

Our initial attempts towards developing a coherent dedispersion system for the GMRT 
were centered around designing an optimized CPU-based implementation of the al¬ 
gorithm (see ifTSll for a cluster based CPU implementation) and comparing its per¬ 
formance with respect to that required for a real-time pipeline. We report the steps 
involved in implementing and optimizing such a CPU-based coherent dedispersion 
system, as well as our results on its performance as compared to real-time process¬ 
ing rates. The computing resources used include an Intel Xeon X5550 CPU @ 2.67 
GHz with 4 cores (supporting a maximum of 8 threads simultaneously) with on board 
RAM of 4 GB. 

We have based our CPU implementation on the Fastest Fourier Transform in the 
West (FFTW) package (Gl), which provides optimized routines to perform forward 
and inverse FFTs on sampled real data. All our routines have been coded using the 
double precision floating point format, since we found that the use of single precision 
resulted in clearly visible artifacts in the dedispersed time series. We believe that 
this is due to the fact that the single precision floating point FFTW routines produce 
increasingly inaccurate results (which are much larger than those for double precision 
computations) when computing longer FFTs (see ©). See Section [4~T| for an example 
of this effect. 


3.1 Parallelizing the algorithm 

Given the different steps involved in the dedispersion algorithm, the first step towards 
optimization would involve dividing the algorithm into different and independent 
sections which can be executed in parallel on a multiple core processor. However, as 
shown in Figure]^ each of the steps in the algorithm is dependent on the earlier step. 
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and hence one is required to proceed by performing the different steps in parallel, but 
on consecutive blocks of data. In this case, the blocks of data correspond to each of 
the long N point segments on which the FFT is to be performed. 

We have designed a scheme where the process has been divided into 3 different 
sections, such that the respective routines can process different blocks of data and 
then transfer these to the next process in the pipeline via shared memory. The division 
of the process into three different sections follows logically from the schematic shown 
in Figure]^ with the individual sections (processes) corresponding to: 

- Inverse transforming the voltage FFT data 

- Performing coherent dedispersion on the voltage time series 

- Detecting and writing the dedispersed intensity signal to a disk 

Seven of the eight available threads on the system are used. The number of threads 
assigned to each process has been selected based on tests to determine the best config¬ 
uration to balance the computational load of different routines. The optimal algorithm 
derived is depicted in Figure]^ 



Fig. 5 An optimized CPU based coherent dedispersion algorithm. Block i is the latest block from the file 
/ shared memory. Each of the routines highlighted in yellow run in parallel, utilizing a certain number of 
threads (highlighted in red). Their tasks are depicted in the respective boxes indicated with dashed lines. 
The processing diagram is for one of the circularly polarised voltage beam signals. 


The details of the individual routines are as follows: 

Inverse transform (Block i): The spectral data from the GSB is from a 1024 point or 
2048 point FFT on the original Nyquist sampled voltage data. The GSB is however 
configured to output this data in larger blocks, consisting of a number of these FFT 
outputs in series. Hence, to improve the processing rate, we use 2 threads on the pro¬ 
cessor to simultaneously inverse transform 2 halves of a given block into a continuous 
time series. The data is then written into a ping-pong scheme double buffer in shared 
memory. The use of the double buffer in shared memory allows the next routine in 
the pipeline to read from a shared memory segment, while the other is being simulta¬ 
neously written into by the current routine. 

Coherent dedispersion (Block i-1): The chirp function is initialized in this routine, 
and the voltage time series (block i-1) from the inverse transform is read from the 
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shared memory double buffer. Multi-threaded FFTW using 4 threads is used to com¬ 
pute the long N-point FFT, the chirp function is applied and then the data is inverse 
transformed into a time series. This time series is written to a single shared memory 
segment for the next step of processing. 

Write to Disk (Block i-2): This routine (using 1 CPU thread) reads the deconvolved 
voltage data (block i-2) written to a shared memory segment by the coherent dedisper¬ 
sion routine, squares it, and integrates it to a given number of samples. This intensity 
output is written to a file on the disk. 


3.2 Performance of the CPU-based pipeline 


The algorithm has been characterized by measuring the processing rate of the inverse 
transform and coherent dedispersion routines (which perform the maximum amount 
of computation), as well as of the entire pipeline itself, as a function of the length 
of the dedispersion transform (twice the number of samples corresponding to the 
dispersion time in the band). We have performed the tests by simulating random data 
for the voltage stream, storing it in the shared memory of the system in a structure 
identical to real-time observations, followed by processing the data with the routines 
described above. Hence, these tests did not involve any disk access and are not I/O 
dominated. 

The lengths of the dedispersion transforms bench-marked are typical lengths of 
the transforms (chosen to be powers of 2) that are required for the observing frequen¬ 
cies and bandwidths available at the GMRT The rate has been expressed as a multiple 
of the real-time data acquisition rate and in terms of the number of GFLOPs achieved 
in run-time. The rate of a routine must be higher than 1 to be able to process data in 
a real-time application. The number of GFLOPs has been estimated by calculating 
the number of floating point operations for each of the two routines. For the Inverse 
Transform routine, this is given by: 

2.5 *A^IogA^ 

GFLOPs =- — (4) 

tns 


where N is the length of the original FFT on the voltage time series and tns is the time 
in nanoseconds taken to compute the inverse transform. For the Coherent dedisper¬ 
sion routine, the same is given by: 


GFLOPs = 


5*N'\ogN' + 3N' 

ins 


(5) 


where N’ is the length of the long FFT to be computed for coherent dedispersion and 
t„s is the time taken in nanoseconds to perform the computations. Here the the first 
term in the numerator corresponds to the computation for the forward and inverse 
FFTs, while the second term comes from applying the complex phases to the ^ 
frequency channels. 

The results are shown in Figures and |7] It is evident that the processing rate 
of the optimized CPU-based implementation is significantly below that required for 
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CPU based implementation 

Processing rates of the routines in 16 MHz mode 



1 ■ 1 ■ 1 • 1 


- 


Full pipeline 

- 


—Coherent dedispersion 

Real-time processing 

“ 

—Inverse Transform 



/N 


-^ 


2.0x10’ 4.0x10’ 6.0x10’ 8.0x10’ 1.0x10* 1.2x10* 1.4x10* 
Length of Transform 


(a) Relative processing rate for the 16MHz mode. 


CPU based implementation 



(b) Relative processing rate for the 32 MHz mode. 

Fig. 6 Processing rate of the CPU based coherent dedispersion system relative to the data acquisition rate 
for 16 MHz and 32 MHz modes, as a function of the length of the dedispersion transform. The dashed line 
indicates the rate required for a real-time pipeline. The rates shown are the mean rates from 1000 runs of 
the process, while the error bars are the standard deviations. 


a real-time application (with the given computing resources). The performance of 
the full pipeline is limited by the slowest routine in the process, i.e., coherent dedis¬ 
persion, and hence the full pipeline performs at roughly the same rate as the coher¬ 
ent dedispersion routine. Our benchmarks indicate that the full CPU-based pipeline 
performs 3.25 x slower (on average) than real-time data acquisition in the 16 MHz 
mode. Consequently, we have reserved our CPU-based implementation exclusively 
for offline processing of data, and developed a real-time pipeline using a GPU-based 
implementation of the algorithm. We describe the working of the real-time pipeline 
in the next section. 
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Performance of the routines in the CPU based impiementation 



Length of Transform 


Fig. 7 The number of GFLOPs achieved in run-time for the Inverse transform (CPU based FFTW on 
2 threads) and Coherent dedispersion routines (CPU based multi-threaded FFTW with 4 threads) as a 
function of the length of the dedispersion transform. The GFLOPs shown are the mean values from 1000 
runs of the process, while the error bars are the standard deviations. 


4 A real-time GPU based pipeline 

A real-time coherent dedispersion pipeline provides the ideal platform for harnessing 
the power of parallel computation on Graphics Processing Units (GPUs). For the cur¬ 
rent implementation, we have used a NVIDIA Tesla C2075 GPU, with 448 processor 
cores, each with a clock speed of 1.15 GHz, and a total internal memory of 6GB, 
mounted on the same host machine as used for the CPU based implementation. We 
present below our implementation of a real-time GPU based coherent dedispersion 
pipeline, which has been optimized such that the most compute intensive portions of 
the algorithm are executed on the GPU. The routines for the GPU-based algorithm 
have been coded in C under the NVIDIA CUDA™ programming environment. 


4.1 The cuFFT library 

Our benchmarks of the CPU-based implementation clearly indicate that the Coher¬ 
ent dedispersion routine is the slowest process in the pipeline, evidently due to the 
computation of the long FFT. The cuFFT library ( ifTSl ), available with the CUDA™ 
platform, provides routines to compute FFTs much faster than standard CPU-based 
FFT libraries, using the computational power of hundreds of processor cores available 
on standard GPUs. Our tests (explained later) clearly demonstrate the improvement 
in the performance of the dedispersion process when using the cuFFT library. We 
have based our routines on the double precision floating point cuFFT routines since 
we found that the use of single precision resulted in clearly visible artifacts in the 
dedispersed time series. Similar to the case of the CPU implementation, we believe 
that this is again due to the fact that the single precision floating point cuFFT rou¬ 
tines produce increasingly inaccurate results when computing longer FFTs, which are 
significantly larger than those produced by the double precision versions (see mi 














A Real-time Coherent Dedispersion Pipeline for the Giant Metrewave Radio Telescope 


15 


Double precision folded profile 



Phase (degrees) 


(a) Folded profile produced from the double precision version of the coherent dedispersion 
pipeline. 


Single precision folded profile 



Phase (degrees) 


(b) Folded profile produced from the single precision version of the coherent dedispersion pipeline. 
The additional artifacts are clearly visible in the profile. 


Fig. 8 Comparison of the folded profiles produced from the single precision and double precision versions 
of the coherent dedispersion pipeline. The profiles are that of the pulsar B0329+54 observed at 325 MHz, 
with a bandwidth of 32 MHz and a duration of 20 minutes with the GMRT. The profiles are normalized 
such that the entire pulse intensity range extends between 0 and 1. The phase range has been zoomed into 
the on pulse region. 


As a demonstration of the artifacts observed when using the single precision FFT 
routines (both in the CPU-based FFTW and the GPU-based cuFFT), we show in 
Figure coherently dedispersed folded profiles from an observation of the pulsar 
B0329H-54 at 325 MHz with the GMRT. Figure [^a) shows the coherently dedis¬ 
persed folded profile using the double precision mode of the pipeline, and is in good 
agreement with profiles published elsewhere (H]). Figure [^b) shows the same data 
coherently dedispersed in the single precision mode, where one clearly observes the 
additional artifacts produced in and around the main pulse region. 
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4.2 Optimization for the GPU 

The availability of thousands of threads on a GPU provides further opportunity to 
parallelize the dedispersion algorithm to improve its performance. The memory of 
the GPU and the host CPU are separate, unless one uses pinned memory on the host. 
A large amount of allocated pinned memory, however, degrades system performance 
since it leaves less memory available for paging. Since the dedispersion process re¬ 
quires computation on a large amount of data at once, this data has to be transferred to 
the GPU memory from the host memory before any computation can be performed. 

Memory transfers turn out to be relatively expensive as compared to kernel exe¬ 
cutions, and hence parallelizing memory transfers and kernel executions can improve 
the performance of an algorithm as compared to a serial implementation. CUBA™ 
allows for simultaneous memory transfers and kernel executions with the use of 
streams. For our algorithm, we have parallelized the memory transfers and kernel 
executions on the GPU such that at any iteration, there are 3 streams functioning on 
the GPU; 

- Transferring an already deconvolved block of data from the GPU memory to the 
host memory. 

- Computing the long FFT, applying the chirp function and then computing an in¬ 
verse FFT on the current data block to get deconvolved voltage data in GPU mem¬ 
ory. The chirp function is applied by initializing a kernel which applies the chirp 
function to each frequency channel in parallel by calling the required number of 
threads. 

- Asynchronously transferring the next data block to be processed from the host 
memory to GPU memory. 

The process is depicted in Figure]^ for a sequence of 4 consecutive data blocks (N 
point FFTs). 



Sequence of _ 
Execution L 




Fig. 9 Illustrating the optimization of the GPU code with concurrent memory transfers and kernel execu¬ 
tions by using streams. Arrows indicate the path of a data block from one stream to another. 
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4.3 The real-time algorithm 


The real-time coherent dedispersion algorithm is similar to the CPU based algorithm, 
with the exception of the long FFTs for the deconvolution being performed on the 
GPU to improve the processing rate of the pipeline. It operates on the following key 
principles: 


The entire process again logically divides into 3 different processes (as described 
in Section |3T), which can be run in parallel on a multi-processor system, but 
on consecutive blocks of data. The individual sections correspond to the inverse 
transform, coherent dedispersion and detection followed by writing to disk re¬ 
spectively. 

As described in Section 4.2 the process handling the computation in the GPU 
can be further sub-divided into 3 parallel processes, handling data transfer and 
computation inside the GPU. This is done to reduce the overhead of memory 
transfers to and from the GPU. Overall, this leads to a 5-way parallelism in the 
final algorithm. 

Transferring the coherent dedispersion process over to the GPU leaves a larger 
number of free threads in the CPU, which can be now used to speed up other 
processes in the pipeline. Consequently, the Inverse transform process now uses 
4 threads instead of 2 as in the CPU based implementation, which leads to a 
significant improvement in the processing rates (described below in Section 4.4 1 . 
The configuration of threads assigned to different processes has been restructured 
to allow optimal performance in the real-time pipeline by balancing the computa¬ 
tion load of different routines. As described below, the overall coherent dedisper¬ 
sion pipeline now uses only 6 of the 8 available threads, with the 2 free threads 
being left to accommodate 2 additional processes which are involved in real-time 
operations: the process writing the voltage beam data to shared memory, and the 
process synchronising the dedispersion process for the 2 polarisations of the beam 
(on different systems) via an MPI based routine. 


In the voltage beam mode of the GSB, the voltage data (in the form of FFT outputs 
from the original sampled voltage) is written into a shared memory on two different 
systems, each receiving the voltage data for two polarisations and synchronized with 
an MPI based routine to allow one to combine the signals from the two polarisations 
after detection. 

The real-time algorithm is depicted in Figure[^ There are again 3 routines which 
run in parallel, and simultaneously handle different consecutive blocks of data, which 
are transferred via shared memory. In this case, with the availability of streams on the 
GPU, we have designed an algorithm with 5-way parallelism to reduce the overhead 
of memory transfers to and from the GPU. The voltage beam data is read directly from 
the shared memory on both the systems which are receiving the data. The pipeline 
runs by performing the following tasks in parallel (the respective routine names are 
shown in bold): 

1. Inverse transform: (using 4 CPU threads) Inverse FFT on beam data for the i* 
block, by splitting the data block into 4 sections and inverse transforming the 4 
blocks in parallel. This allows the simultaneous utilisation of 4 threads on the 
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CPU (note that we could use only 2 threads for this process in the CPU-based 
implementation since the Coherent dedispersion routine used up 4 threads) to 
complete this task. The data is written to a shared memory double buffer. The use 
of the shared memory double buffer allows the next routine in the pipeline to read 
from one segment in the shared memory while the next block is still being written 
into the other segment simultaneously. 

2. Coherent dedispersion; (using 1 CPU thread) 

(a) Copy (i-1)* block from shared memory to GPU memory 

(b) Perform long FFT on (i-2)*, apply chirp function and then produce dedis- 
persed time series with appropriate overlap. For applying the chirp function, 
N/2 threads of a particular kernel (corresponding to the choice of a particu¬ 
lar taper function) are initialised, corresponding to each of the N/2 frequency 
channels produced by the forward FFT in the GPU. 

(c) Copy (i-3)* block from GPU to host shared memory in a double buffer. 

The use of this 3 way parallelism inside the GPU allows us to eliminate overheads 
of data transfer to and from the GPU, by simultaneously processing and copying 
different segments of data from the voltage beam. 

The overlap-save method of computing convolutions requires one to reject a cer¬ 
tain number of samples at the end of a given block, and overlap the same length 
with the previous block (corresponding to the length of the impulse response). 
Hence, with every input block of the long FFT, one loses samples from the end of 
the block. To ensure continuity in the time series, we have implemented the GPU 
routine such that the number of samples lost in this process is then prefixed to the 
next block in the input data stream. 

3. Write to disk; (using 1 CPU thread) Read deconvolved voltage data from the 
shared memory double buffer. Square, integrate to certain number of samples, and 
write intensity to a disk. Again, the use of the double buffer allows this detection 
routine to read from a shared memory segment simultaneously while the next 
segment is being written to by the GPU routine Coherent dedispersion. 

The DSPSR (||23) also uses some of the techniques we have used in our imple¬ 
mentation, including the use of ring buffers in shared memory for real-time signal 
processing. Further, the thread-safe input buffering technique in DSPSR is similar 
to the technique we have used to account for the samples lost in the overlap-save 
method of computing convolutions. However, the process of initially inverse trans¬ 
forming the data is an additional task that is required for the GMRT voltage beam and 
has been a part of our implementation. Further, since DSPSR was designed primarily 
to handle larger bandwidths at higher frequencies, it uses the filter-bank scheme to 
split the full band into narrower bands on which coherent dedispersion is performed. 
Since we have relatively smaller bandwidths to deal with at the GMRT, we do not use 
sub-banding which, in principle, allows us to achieve higher time resolutions. 


4.4 Performance of the GPU-based pipeline 

Similar to the analysis done for the CPU based implementation described in Section 
|3.2[ we have benchmarked the performance of our GPU-based real-time pipeline, us- 
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Fig. 10 The real-time coherent dedispersion algorithm. Block i is the latest block from the file / shared 
memory. Each of the 3 routines mentioned in the text run in parallel, utilizing a certain number of threads 
on the CPU (highlighted in red). The Coherent dedisperison routine invokes the computation on the GPU, 
whereas the other two routines run only on the CPU. Their tasks are depicted in the respective boxes 
indicated with dashed lines. 


ing the same equations to measure the processing rates. The improvements clearly 
show up as a significant increase in the computation rate, and in the number of 
GFLOPs achieved in run-time. The results are shown in Figures [TT| and [T^ 

The performance of the GPU-based pipeline shows improvements on a number 
of fronts as compared to the CPU based implementation. Firstly, the processing rate 
of the Inverse Transform routine shows about a factor of 3 improvement in process¬ 
ing rate as compared to the CPU-based implementation (and overall 3 times faster 
than real-time data acquisition rate in 32 MHz mode), both in time taken and the 
number of GFlops achieved. Though this routine still runs on the CPU, this marked 
improvement arises due to the routine now using 4 threads, instead of 2 in the CPU- 
only based implementation. The most compute intensive portion of the algorithm, i.e. 
the coherent dedispersion routine, shows a factor of about 9 improvement in process¬ 
ing rate (and overall 1.5 times faster than real-time data acquisition rate in 32 MHz 
mode) on the GPU, which was what we originally aimed to achieve by moving our 
implementation to a GPU. 

Clearly, porting over the algorithm to a GPU has now allowed us to achieve bet¬ 
ter than real-time processing rates for the coherent dedispersion pipeline. Our bench¬ 
marks indicate that the optimized full pipeline can perform coherent dedispersion 1.4 
X faster (on average) than real-time data acquisition in the 32 MHz mode and 2.5 x 
faster (on average) than real-time data acquisition in the 16 MHz mode. The process¬ 
ing rate of the pipeline is thus well-suited for a real-time application, which has been 
subsequently verified in real-time tests. We present these results in the next section. 


4.5 Sample results from the real-time pipeline 

The coherent dedispersion pipeline has been tested in real-time at the GMRT, running 
in parallel on two systems with identical specifications, each receiving one polarisa- 
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GPU based implementation 

Processing rates of the routines in 16 MHz mode 



(a) Relative processing rate for the 16MHz mode. 


GPU based implementation 

Processing rates of the routines in 32 MHz mode 



(b) Relative processing rate for the 32 MHz mode. 

Fig. 11 Processing rate of the GPU based coherent dedispersion algorithm relative to the data acquisition 
rate for 16 MHz and 32 MHz modes, as a function of the length of the dedispersion transform. The dashed 
line indicates the rate required for a real-time pipeline. The rates shown are the mean rates from 1000 runs 
of the process, while the en'or bars are the standard deviations. 


tion of the voltage beam data. We present here the first results obtained from the real¬ 
time pipeline. In Figure [T^ (a), the folded profile for the pulsar B0329H-54 observed 
at 325 MHz with the real-time coherent dedispersion pipeline is shown. Figure [T3l(b) 
shows a sample single pulse from the same data set, exhibiting clear evidence for the 
occurence of micropulses. Figure [T4| shows folded profiles for the pulsars B0833-45 
and B2016H-28 (zoomed into the on pulse region). The observations parameters are 
given in Table [T] and have also been mentioned in the figure captions. The folded 
profiles agree with the ones available in the databases provided in [H and llT2l . 

These tests and sample results show that the pipeline works stably, maintaining 
real-time performance without losing any blocks of data - segments of the voltage 
time series which are written as individual blocks in the shared memory by the GSB. 

































A Real-time Coherent Dedispersion Pipeline for the Giant Metrewave Radio Telescope 


21 


Performance of the routines in the GPU based implementation 



0 1x10' 2x10' 3x10' 4x10' 5x10' 6x10' 7x10' 

Length of Transform 


Fig. 12 The number of GFLOPs achieved in run-time for the Inverse transform (CPU based FFTW on 
4 threads) and Coherent dedispersion routines (GPU based computation with the cuFFT routines) as a 
function of the length of the dedispersion transform. The GFLOPs shown are the mean values from 1000 
runs of the process, while the error bars are the standard deviations. 


If the pipeline does not process data fast enough, blocks of data would be lost in real¬ 
time analysis, leading to gaps in the observed time series. These gaps would show up 
as multiple pulse features at different phases in the folded profile, which is not the 
case. 

Currently, the range of DMs for a given frequency of observation that can be 
dedispersed in real-time using our pipeline are limited only by the memory of the 
GPU. For the observing frequency of 610 MHz, the highest DM that can be dedis¬ 
persed is « 3000 pc/cc and « 750 pc/cc for the 16 MHz and 32 MHz modes respec¬ 
tively. For the observing frequency of 325 MHz, the corresponding maximum DMs 
are « 480 pc/cc and « 120 pc/cc for the 16 MHz and 32 MHz modes respectively. 


Figure 

Observation 

Frequency 

(MHz) 

Bandwidth 

(MHz) 

Pulsar 

Name 

Pulsar 
Period (ms) 

DM 

(pc/cc) 

Duration of 
observation 

12(a) 

325 

32 

B0329+54 

714.5 

26.77 

900 s 

12(b) 

325 

32 

B0329+54 

714.5 

26.77 

19.8 ms 

13(a) 

235 

16 

B0833-45 

89.3 

67.99 

300 s 

13(b) 

325 

32 

B2016+28 

557.9 

14.172 

900 s 


Table 1 Observation parameters for the tests carried out with the coherent dedispersion pipeline 


5 Conclusions and Future Prospects 

We have developed a real-time coherent dedispersion pipeline for the voltage beam 
mode of the GMRT using a GPU-based implementation of the algorithm. Our bench¬ 
marks indicate that, using a Tesla C2075 GPU, the pipeline is able to perform coher¬ 
ent dedispersion at better than real-time data acquisition rates for the currently avail- 




22 


Kishalay De, Yashwant Gupta 



-20 -15 -10 -5 0 5 10 15 20 


Phase (degrees) 


(a) The folded profile of the pulsar B0329+54. The profile has been zoomed into the on-pulse 
region. 


single pulse profile 



(b) A single pulse from the data stream corresponding to the folded profile in (a), clearly showing 
small time-scale structure in the pulse intensity. The figure has been zoomed into the region of the 
central component of the pulse in phase. 


Fig. 13 Results obtained from the real-time coherent dedispersion pipeline for the pulsar B0329+54, ob¬ 
served at 325 MHz with a bandwidth of 32 MHz for a duration of 15 minutes. The pulsar has a DM of 
26.77 pc/cc and a period of 714.5 ms. Time resolution used is 15 jis. Fig. (a) shows the folded profile, 
whereas Fig. (b) shows a sample single pulse from the data. 


able bandwidths at the GMRT, of 16 MHz and 32 MHz, by factors of approximately 
2.5 & 1.4, respectively. This real-time pipeline has been tested fairly extensively and 
is found to work reliably, without losing any data. A CPU based implementation of 
the pipeline has also been developed and released as an offline processing package, 
along with an offline version of the GPU-based implementation. The first results from 
the pipeline clearly demonstrate the potential of this system as a powerful tool for low 
frequency pulsar observations at the GMRT, covering aspects such as studies of pul- 
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Folded profile for B0833-45 - 



-150 -100 -50 0 50 100 150 

Phase (degrees) 


(a) Folded profile of the pulsar B0833-45 observed at 235 MHz with a bandwidth of 16 MHz for 
a duration of 5 minutes. The pulsar has a DM of 67.99 pc/cc and a period of 89.3 ms. Owing to 
the high DM and low frequency of observation, the scattering tail is very prominent and extends 
throughout the profile. Time resolution used is 30 fls. 


Folded profile 



(b) Folded profile of the pulsar B2016+28 observed at 325 MHz with a bandwidth of 32 MHz for 
a duration of 15 minutes. The profile has been zoomed into the on-pulse region. The pulsar has a 
DM of 14.172 pc/cc and a period of 557.9 ms. Time resolution used is 15 fls. 

Fig. 14 Folded profiles for the pulsars B0833-45 and B2016+28 obtained with the real-time coherent 
dedispersion pipeline. 


sar microstructure, accurate profiles of millisecond pulsars at low radio frequencies 
and development of a high fidelity pulsar timing machine. 

Our coherent dedispersion pipeline has some unique features as compared to ex¬ 
isting coherent dedispersion systems (described in Section o ; Firstly, our system 
works completely in real-time, without the requirement of recording data on disk (as 
implemented in PuMa-II). Further, our system does not split the bandwidth of obser¬ 
vation to reduce the computational load of the process, which allows one to achieve 
higher time resolutions than that obtainable by sub-banding. Also, we have imple¬ 
mented the process with transform lengths much larger than the ones explored by 
other systems (e.g. GUPPI), as the impulse response duration of the ISM is much 
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longer at the low frequencies our pipeline operates at. This is thus one of the first 
cases of the development of a real-time coherent dedispersion pipeline for a large, 
low frequency multi-element radio telescope like the GMRT. 

Our current design, matched to the existing GMRT system, is easily extensible 
to a larger bandwidth system. In particular, we plan to extend the implementation of 
this pipeline to the GMRT Wideband Back-end (GWB) that is being developed for the 
upgraded GMRT. For the large bandwidths of operation of the GWB (100, 200 and 
400 MHz), recording full time resolution voltage data (if possible at all) will require 
a prohibitively large amount of disk storage space, even for relatively small duration 
observations. In such a case, a real-time pipeline for coherent dedispersion becomes 
a necessity, if one wants to fully exploit the wider bandwidths of the upgraded GMRT 
and make it an outstanding instrument to study pulsars at low frequencies. This should 
be fairly easy to do, as we show below. 

The computational complexity for implementing the real-time coherent dedisper¬ 
sion pipeline stems from two aspects : the first is that the length of the FFTs involved 
in the deconvolution increases as BW^ (where BW is the bandwidth of the obser¬ 
vations), since both the sampling rate and dispersion time across the band increase 
approximately linearly with BW. This would amount to a factor of 100 times in¬ 
crease over that of the current GSB implementation, and one may run into memory 
size limitations on the GPU. This problem can be partially overcome by the trade¬ 
off of splitting the wideband into a few narrower sub-bands, the coherent dedisper¬ 
sions of each of which can fit within the memory constraints of the GPU, and then 
combining the coherently dedispersed signals from all the sub-bands to get a time 
resolution that is the inverse of the bandwidth of one sub-band. The second is the 
fact that even for a given length of the FFT, the total computational load would go 
up as BW * log{BW), implying a factor of approximately 30 increase in the required 
computation rate. Since the current system is built on a relatively old GPU (Tesla 
C2075), it should be possible to overcome both these challenges with the future gen¬ 
erations of the GPUs that are expected to become available in the near future, and 
thus implement a real-time coherent dedispersion system for the upgraded GMRT. 
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